1 Introduction

Why we care about this topic and what we would like to learn

2 Exploratory Data Analysis

2.1 Who Are the Soldiers?

Survey 32 was given out to soldiers in 1943, approximately 5 years before the military was integrated. The survey was passed out to 7442 black soldiers and 4793 white soldiers and asked for basic demographic information, career aspirations, and more but of interests to us, Survey 32 asked the soldiers for their opinions on integration of military outfits. Our questions of interest are regarding age, education, enlistment, state, community type, and of course their opinions on outfits. On the survey these questions were asked in Questions 1,2,3,13,14, and 77 (63 for white soldiers), respectively. We also looked at questions regarding what their thoughts were about the future and how black rights and treatment will change after the war.

2.1.1 Age

Age was not collected on a continuous scale and was discretized into a few different age groups. We see that the overwhelming bulk of black soldiers who were survied were 20 years old with a small portion who were 19 or younger. In the meanwhile, the white soldiers had more spread to their ages with most soldiers being between the ages of 21 and 24.

2.1.2 Education

If we look at education now we see that again black soldiers have little spread in their education. Remarkably, all of the black soldiers survied have less than a 5th grade education at the time. Meanwhile, the bulk of the white soldiers have had a high school/some high school.

When we overlay the distribution of education levels with age ranges, we see that older white soldiers made up a larger porportion of white soldiers with less education compared to soldiers with some high school. As a contingent, it appears that soldiers between 21 and 24 with a high school education make up the largest contingent of white white soldiers when grouped by education and age.

2.1.3 Enlistment

Something interesting arises here were we find that vast majority of the black soldiers actually volunteered to join the military whereas about 3/4 of the survied white soldiers were drafted and the remaining soldiers were mostly volunteers and a few were from the National Guard.

2.1.4 Location

Expectedly, most of the soldiers hailed from the most populous states at the time. White soldiers were mostly from Illionois, Pennsylvania, Ney York, Texas, and Michigan while black soldiers were mostly from Texas, New York, Illinois, Pennsylvania, and Ohio. Note that the top 4 states for white soldiers had similar amounts of soldiers but there was a sever drop off in representation of black soldiers from other states after Texas and New York.

2.1.5 Communities

As expected, most soldiers whose home communities are large cities had the most representation across both groups. White soldiers saw roughly equal representation from soldiers who came from a farm, town, or city with actually slightly less people from cities. On the otherhand, the next community with the largest representation for black soldiers was a city followed by farms and towns which had approximately similar contributions.

We see that larger portions of soldiers who are more educated come from communities which are larger in population.

2.1.6 Integrating Outfits

Our key variable of interest from this survey is the soldiers opinions on integrating their outfits. Expectedly, we see the vast majority of white soldiers are against integrating however the black soldeirs seem to be divided on whether they want integration or not. They are rougly evenly split on keeping outfits seperated and integrating them and a good amount are also undecided or indifferent.

If we look at the proportion of ages who elected for each category we see that the proportions are relatively stable across all opinions towards integration.

Now if we are to overlay the education distribution over the integration opinions we see something more interesting. It appears that the white soldiers that voted for the outfits to be together skew towards being more educated. In fact, over 50% of the soldiers who did vote for integrated units have atleast finished high school. This is not the case for any of the other responses.

Across both races we also see that of those who choose integration a greater portion were from large cities and soldiers who came from more populated voted for sepration less proportionally.

2.1.7 Thoughts on the future

The majority of the white soldiers believed that their rights will not change after the war and roughly equal amoutns thought they would increase or decrease. About 40% of the black soldiers thought their rights would increase following the war. A slightly smaller amount expected no change at all. Interestingly, the black soldiers answers to whether black people will have more rights after the war was nearly identical, but now there are more white soldiers who think black people will get more rights. The majority of black soldiers thought that after the war white people would treat them the same but about 30% were optimistic that they'd recieve better treatment. Interestingly,

2.2 Differences in Sets - Non-Stemmed

2.3 Differences in Sets - Stemmed

2.3.1 Long Responses

2.3.2 Short Responses

3 Sentiment Analysis

3.1 Differences in Sets

4 Social Network Analysis

4.1 Social Networks with Unionized Terminology

4.1.1 Long Responses

4.1.2 Short Responses

4.2 Identifying Topics

4.2.1 Topic Model Networks

A topic model put simply models the topics in a piece of text and the words that are associated with each topic. Naturally, words may fall in multiple topics and the model accounts for this by giving each topic a probability distribution over the words. A Topic Model Network is a useful way to visualize the topics and the words associated with each topic. Here we will explore two different topic models.

4.2.2 Latent Dirichlet Allocation

Latent Dirchlet Allocation, or LDA, is the typical go to method for topic modelling. The first network here displays num_clusters topics and this is for the black soldiers response to the long comment

4.2.3 BTM

There are some drawbacks to using LDA for our dataset, namely it doesn't handle short texts well. That is why we also implemented a Biterm Topic Model developed by [citation or sumthin] that does better on short texts.

## 2020-07-29 10:55:12 Start Gibbs sampling iteration 1/2000
## 2020-07-29 10:55:13 Start Gibbs sampling iteration 101/2000
## 2020-07-29 10:55:14 Start Gibbs sampling iteration 201/2000
## 2020-07-29 10:55:16 Start Gibbs sampling iteration 301/2000
## 2020-07-29 10:55:17 Start Gibbs sampling iteration 401/2000
## 2020-07-29 10:55:18 Start Gibbs sampling iteration 501/2000
## 2020-07-29 10:55:20 Start Gibbs sampling iteration 601/2000
## 2020-07-29 10:55:21 Start Gibbs sampling iteration 701/2000
## 2020-07-29 10:55:22 Start Gibbs sampling iteration 801/2000
## 2020-07-29 10:55:24 Start Gibbs sampling iteration 901/2000
## 2020-07-29 10:55:25 Start Gibbs sampling iteration 1001/2000
## 2020-07-29 10:55:26 Start Gibbs sampling iteration 1101/2000
## 2020-07-29 10:55:28 Start Gibbs sampling iteration 1201/2000
## 2020-07-29 10:55:29 Start Gibbs sampling iteration 1301/2000
## 2020-07-29 10:55:30 Start Gibbs sampling iteration 1401/2000
## 2020-07-29 10:55:32 Start Gibbs sampling iteration 1501/2000
## 2020-07-29 10:55:33 Start Gibbs sampling iteration 1601/2000
## 2020-07-29 10:55:35 Start Gibbs sampling iteration 1701/2000
## 2020-07-29 10:55:36 Start Gibbs sampling iteration 1801/2000
## 2020-07-29 10:55:37 Start Gibbs sampling iteration 1901/2000
## 2020-07-29 10:55:39 Start Gibbs sampling iteration 1/2000
## 2020-07-29 10:55:39 Start Gibbs sampling iteration 101/2000
## 2020-07-29 10:55:39 Start Gibbs sampling iteration 201/2000
## 2020-07-29 10:55:39 Start Gibbs sampling iteration 301/2000
## 2020-07-29 10:55:39 Start Gibbs sampling iteration 401/2000
## 2020-07-29 10:55:39 Start Gibbs sampling iteration 501/2000
## 2020-07-29 10:55:39 Start Gibbs sampling iteration 601/2000
## 2020-07-29 10:55:39 Start Gibbs sampling iteration 701/2000
## 2020-07-29 10:55:39 Start Gibbs sampling iteration 801/2000
## 2020-07-29 10:55:39 Start Gibbs sampling iteration 901/2000
## 2020-07-29 10:55:39 Start Gibbs sampling iteration 1001/2000
## 2020-07-29 10:55:40 Start Gibbs sampling iteration 1101/2000
## 2020-07-29 10:55:40 Start Gibbs sampling iteration 1201/2000
## 2020-07-29 10:55:40 Start Gibbs sampling iteration 1301/2000
## 2020-07-29 10:55:40 Start Gibbs sampling iteration 1401/2000
## 2020-07-29 10:55:40 Start Gibbs sampling iteration 1501/2000
## 2020-07-29 10:55:40 Start Gibbs sampling iteration 1601/2000
## 2020-07-29 10:55:40 Start Gibbs sampling iteration 1701/2000
## 2020-07-29 10:55:40 Start Gibbs sampling iteration 1801/2000
## 2020-07-29 10:55:40 Start Gibbs sampling iteration 1901/2000
## 2020-07-29 10:55:40 Start Gibbs sampling iteration 1/2000
## 2020-07-29 10:55:41 Start Gibbs sampling iteration 101/2000
## 2020-07-29 10:55:42 Start Gibbs sampling iteration 201/2000
## 2020-07-29 10:55:42 Start Gibbs sampling iteration 301/2000
## 2020-07-29 10:55:43 Start Gibbs sampling iteration 401/2000
## 2020-07-29 10:55:43 Start Gibbs sampling iteration 501/2000
## 2020-07-29 10:55:44 Start Gibbs sampling iteration 601/2000
## 2020-07-29 10:55:45 Start Gibbs sampling iteration 701/2000
## 2020-07-29 10:55:45 Start Gibbs sampling iteration 801/2000
## 2020-07-29 10:55:46 Start Gibbs sampling iteration 901/2000
## 2020-07-29 10:55:47 Start Gibbs sampling iteration 1001/2000
## 2020-07-29 10:55:47 Start Gibbs sampling iteration 1101/2000
## 2020-07-29 10:55:48 Start Gibbs sampling iteration 1201/2000
## 2020-07-29 10:55:48 Start Gibbs sampling iteration 1301/2000
## 2020-07-29 10:55:49 Start Gibbs sampling iteration 1401/2000
## 2020-07-29 10:55:50 Start Gibbs sampling iteration 1501/2000
## 2020-07-29 10:55:50 Start Gibbs sampling iteration 1601/2000
## 2020-07-29 10:55:51 Start Gibbs sampling iteration 1701/2000
## 2020-07-29 10:55:51 Start Gibbs sampling iteration 1801/2000
## 2020-07-29 10:55:52 Start Gibbs sampling iteration 1901/2000

```

networks in the context of networks

5 Conclusion

what we learned, why it matters